Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale

نویسندگان

  • Jinyang Gao
  • H. V. Jagadish
  • Beng Chin Ooi
چکیده

Recent years have witnessed amazing outcomes from “Big Models” trained by “Big Data”. Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the training data in each iteration. Typically, the data are either uniformly sampled or sequentially accessed. In this paper, we study how the data access pattern can affect model training. We propose an Active Sampler algorithm, where training data with more “learning value” to the model are sampled more frequently. The goal is to focus training effort on valuable instances near the classification boundaries, rather than evident cases, noisy data or outliers. We show the correctness and optimality of Active Sampler in theory, and then develop a light-weight vectorized implementation. Active Sampler is orthogonal to most approaches optimizing the efficiency of large-scale data analytics, and can be applied to most analytics models trained by stochastic gradient descent (SGD) algorithm. Extensive experimental evaluations demonstrate that Active Sampler can speed up the training procedure of SVM, feature selection and deep learning, for comparable training quality by 1.6-2.2x.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extending Database Accelerators for Data Transformations and Predictive Analytics

The IBM DB2 Analytics Accelerator (IDAA) integrates the strong OLTP capabilities of DB2 for z/OS with very fast processing of OLAP workloads using Netezza technology. The accelerator is attached to DB2 as analytical processing resource – completely transparent for user applications. But all data modifications must be carried out by DB2 and are replicated to the accelerator internally. However, ...

متن کامل

In situ Visualization and Analysis of Ion Accelerator Simulations using Warp and VisIt

The generation of short pulses of ion beams through the interaction of an intense laser with a plasma sheath offers the possibility of compact and cheaper ion sources for many applications; from fast ignition and radiography of dense targets to hadron therapy and injection into conventional accelerators. To enable the efficient analysis of large-scale, high-fidelity particle accelerator simulat...

متن کامل

Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying

The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We prop...

متن کامل

Application of Big Data Analytics in Power Distribution Network

Smart grid enhances optimization in generation, distribution and consumption of the electricity by integrating information and communication technologies into the grid. Today, utilities are moving towards smart grid applications, most common one being deployment of smart meters in advanced metering infrastructure, and the first technical challenge they face is the huge volume of data generated ...

متن کامل

Big Data Analytics and Now-casting: A Comprehensive Model for Eventuality of Forecasting and Predictive Policies of Policy-making Institutions

The ability of now-casting and eventuality is the most crucial and vital achievement of big data analytics in the area of policy-making. To recognize the trends and to render a real image of the current condition and alarming immediate indicators, the significance and the specific positions of big data in policy-making are undeniable. Moreover, the requirement for policy-making institutions to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1512.03880  شماره 

صفحات  -

تاریخ انتشار 2015